The Grammar of Graphics


GGplot2 is one of the core packages under the tidyverse package, which is collection of R packages designed for data science.

The “gg” stands for “Grammar of Graphics”, a book by Leland Wilkinson that offers tools to concicley describe the components of a graphic in statistics and computing.

GGplot2 logic stems from this idea, that you can build every graph from the same few components: a data frame, visual marks (geoms) representing the data, and a coordinate system.

It is more flexible and versatile than the graphs produced by R’s base package, and once you get a grip of the syntax and function arguments, it becomes easy to create beautiful and elaborate visualizations.

As Hadley Wickham explained: “You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.”

Grammatical elements of ggplot2


A key feature of ggplot2 is that it allows to layer graphical elements on top of each other. You can imagine a stack of layers, each adding onto the layers before it.

  • Data - the data frame we want to use for our plot
  • Aesthetics (aes) - the scales we want to map our data onto
  • Geometrics (geom) - the geometrical shapes representing our data
  • Themes - the appearance of the non-data aspects of the plot
  • Statistics - data representations to aid understanding
  • Coordinates/Scales - the range and limits of our plot
  • Facets - the layout of multiple plots and subplots

The first three elements: data, aesthetics (aes), and geometrics (geom), are the basic elements.
We must define them in the ggplot function in order to produce a meaningful plot.

The remaining elements are “optional”, that is, they are set to a default. This means we are not required to define them when we plot, though typically we would want to adjust them to make sure our graphs better fit our needs.

In this presentation I will focus mainly on the first three elements, and specifically on the most commonly used geoms.

Lets get to work!

Installing packages


Begin by installing and loading the tidyverse package, which includes ggplot2, among other usefull packages such as dplyr and tidyr which are used for manipulating data prior to plotting.

You only need to install a package once, but you will need to “load” it every time you restart a session

If you solely want to install the ggplot2 package you can use a similar line of code, but you will most likely use dplyr, so you may as well install tideyverse which includes both (and more)

Our Data


For this exercise we will use diamonds from the dataset package, and the gapminder dataset from the gapminder package. Both are available on r.

## Warning: package 'gapminder' was built under R version 3.6.3

We will start working with the diamonds data.

The first step should always be to examine the dataset. What variable we have? What datatype is each variable? How many observations are included?

You can use the structure function str(), or the summary function summary() if you want more details on each variable.

## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

If only want the variable names for easier access, you can simply list the column of the dataset using colnames().

##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"

Now let’s continue exploring the data by plotting it with ggplot2

The ggplot2 syntax


The first line of code in ggplot2 requires us to input the data frame we intend to use, and the aesthetics we want to map our data on. This line typically includes all the data needed for creating the plot. The function synatx is writtern as: ggplot(data, aes())

For instance, to plot the price of diamonds based on their carat we need to set “diamonds” as the data, and map “carat” and “price” onto the x and y aesthetics.

The function can be written either as: ggplot(data = diamonds, aes(x = carat, y = price))
or simply as: ggplot(diamonds, aes(carat, price))

This creates the base layer of our plot, which includes the dimensions we defined for the aesthetics. In order to present the observations, we need to add geometric layers. For every layer we add, we need to place a “+” sign.

For instance, to present a trend line of the average price by carat, we can add a geom_smooth() layer. This geom creates a regression line with a confidence intervals.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

However, a regression line is not very telling about the observations. In this instance, it would make more sense to create a scatterplot in order to see the spread of the observations. We can do this by adding a geom_point() layer.

Scatterplots

Many of the observations are overlapping, making it difficult to see the actual distribution. To help remedy overplotting, we can adjust the transparency of the points by reducing the alpha and also increase the size of the points inside the geom_point layer.

This looks better, but it is still difficult to make insights from this plot. We can add another aesthetic mapping to deferantiate between diamonds with different cuts. In this example we will map “cut” onto the color aesthetic in the ggplot line,

We could change the aestehtic maping inside the geom layer, rather then the ggplot line.We might choose to do so if we want to assign different aesthetic mappings to different geom layers, or if we are plotting values from different data frames.

In the example above, if we move the aes(color = cut) into the geom_point() layer, we will produce the same graph.

Finaly, notice the difference between aesthetic mappings, which represent scales, and atributes which represent fixed values.

Instead of assigning a fixed size, we could map size onto a variable

Stacking Layers As mentioned previously, ggplot2 enables us to add multiple element layers on top of each other.

We need to add another “+” sign at the end of each row to indicate there is another line.

When adding geom layers, each new layer will appear on top of the previous layers. This means the order of the geom layers matters.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Notice that the aesthetic mappings defined in the first line are automatically adopted by all the geom layers. Aesthetics and attributes added to an individual geom layer affect only that layer, and they can override aesthetic mappings from the main ggplot() line.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

We can add multiple geom layers of the same type

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 38 rows containing missing values (geom_smooth).

Each geom type has multiple arguments which are set to to default values, which we can easily change based on our needs. For instance, geom_point can take arguments relating to x, y, alpha, color, fill, shape, and weight .

In the previous examples we changed the alpha and size attributes of the points.

For a cheat sheet with ggplot2 geom argumentsby rStudio visit this link.

Improving the plot

Before we continue, let’s make make our lives a bit easier. Instead of typing the function over and over again, we can assign the function to an object and simply add layers to that object.

Now we can add layers and adjustments to “dd” which already containts our predefined ggplot() + geom_point() .

Vertical & Horisontal Lines

We can add lines to indicate the median and mean of carat. To add vertical and horizontal lines we use geom_vline() and geom_hline() correspondingly.

We can also add tags to the lines with geom_tex() to indicate what they represent.

We can see that much of the data is condenced on the left side of the plot. We can handle this by adjusting the data or, better yet, adjusting the scale.

Adjusting the data

Using dplyr functions, we can filter out observations greater than 3 carats. We’ll create a new dataset by saving the filtered data into an object called “smallD

We then plot the same aesthetics using the new data frame “smallD

Adusting the scales

Instead of filtering out extreme observations, we can adjust the x axis, either by changing its limits with xlim(), or by LOGing the values of the x scale with scale_x_log10()

Limiting the scale deletes the points outside the limit range

## Warning: Removed 32 rows containing missing values (geom_point).

Limiting the x scale for the diamond dataset created a graph that is identical to the one we created with the smallD dataframe.


Loging the scale keeps all the data points, but stretches the axis exponentially

LOGing is useful when the data is very skewed, as in the case of the gapminder data which I will go back to at the end of the presentation. For the diamond dataset, I would probably choose to limit the axis scale rather than LOG the scale.

Facets & Themes


We can further exmaine diferences by arranging the data into subplots. with facet_grid() and facet_wrap()

facet_grid() creates

One of the benifit of creating sub plots with facets is that the scales are paralel across plots.

Now lets see what happens when we use geom_point() with a categotical X.

There is over plotting bevause all the observations are alligned on the same x value. GGplot2 enables us to “jitter” the points in order to overcome overplotting. We do this either by adding a geom_jitter() layer instead of geom_point() layer, or alternativley, we can add a jitter argument into the geom_point() line as follows:

geom_point(position = “jitter”)

We focused on many examples with scatterplots (geom_point()), but the logic of the function arguemnts and layers is aplicable to the other geom types as well.

Bar Charts

The height of bars geom_bar() represents the number of cases in each group. Thus it only takes an “x” aesthetic.

The height of bars geom_col() represents other other values in the data, which is why it also requires a “y” aesthetic.

Alternativley, you could change the stat argument setting inside geom bar to identity in the following manner geom_bar(stat = “identity”), which will enable it to take on a “y” aesthetic as well. _

Asignng the color aesthetic would change the color of the outlines rather then the fill of the bars. to change the color of the bars we use the fill aesthetic.

By assigning the fill to another variable, we split each bar into subgroups.

The default position is set to “stack”, which is why the cut levels are stacked upon each other. The other options are position = “fill” which fills each bar to represent 100%. The third option is position = “dodge” which places the groups next to eachother

Finaly, you can also change the direction of the bar by fliping it 90 degrees with coord_flip(), or create a circular center with coord_polar()

Histograms

Histograms are used for contiuus x variables, as opposed to bar charts which are used for catagrocial variables.

Coming soon

Line graphs

Line graphs produced by geom_line are suitable for longitudinal data in which we desire to show variance over time, or between different treatments. For the diamond data a line graph will look like a hot mess.

To better demostrate the line geom, We will move to the gapminder data which contains information on life expectancy of countries over the past seven decades.

The Gapminder data


## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

Lets see what happens when we plot life expectancy by year using geom_line().

This graph is not very telling because it is basically going through all the data points on each year. We can add a color aesthetic to create subgroups for each continent. Let’s check if that helps.

It still looks like a mess because the lines are still passing through all the data points of each year. What we want is lines that represent the average of each group, similar to the trend line produced by geom_smoot().

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

So basically, we need to create another variable that represents the average for each group by year. This is where dplyr becomes important and useful. We can easily create new dataframe from the gapminder data via dplyr functions to add the desired values. We first group_by continenet and year, and then we add summarize variables which calculate the total population and the average life expectency.

I saved this dataframe in an object called “yearContinent

## # A tibble: 60 x 4
## # Groups:   year [12]
##     year continent   totalPop AverageLifeExp
##    <int> <fct>          <dbl>          <dbl>
##  1  1952 Africa     237640501           39.1
##  2  1952 Americas   345152446           53.3
##  3  1952 Asia      1395357351           46.3
##  4  1952 Europe     418120846           64.4
##  5  1952 Oceania     10686006           69.3
##  6  1957 Africa     264837738           41.3
##  7  1957 Americas   386953916           56.0
##  8  1957 Asia      1562780599           49.3
##  9  1957 Europe     437890351           66.7
## 10  1957 Oceania     11941976           70.3
## # … with 50 more rows

Now when we add a geom_line(), the lines represent the average for each continent

We can also add a geom_point() on top of the lines to so that average value for each year is visually clearer.

Final notes

Now lets go back to and see what the data looks like when geom_point() to create a scatther plot by year

This happened because year is a categorical variable. Remember the jitter option for geom_point() which “jitters” the points to avoid overplotting?

We can also facet the data by year to see variations by year.

remeber that LOGing scales helps when the data is very skewed.This data is much more skewed than the diamond data.

Now lets put everything we learned together.

First we’ll assign the main function and scale log to an object named “gm” (short for gapminder)

Now we will plot the object and add faceting, themes, and labels.

This is all for now

Thank you!

Tutorials

Continue learning and practicing ggplot2 on your own:


  1. Data Visualization - in “R for Data” Science - Hadley Wickham’s e-book guide to R
  2. The Complete ggplot2 Tutorial - - Tutorial by Selva Prabhakaran
  3. Stack Overflow - Great forum for asking questions from the community
  4. Data Camp course - First lesson of each course is free
  5. Interactive charts - Convert your ggplot2 figures into interactive ones powered by plotly.js